Discourse for Machine Translation
نویسنده
چکیده
Statistical Machine Translation is a modern success: Given a source language sentence, SMT finds the most probable target language sentence, based on (1) properties of the source; (2) probabilistic source--target mappings at the level of words, phrases and/or sub-structures; and (3) properties of the target language. SMT translates individual sentences because the search space even for a single sentence can be vast. But sentences are parts of texts, and texts have properties beyond those of their individual sentences, including: • document-wide properties, such as style, register, reading level and genre, that are visible in the frequency and distribution of words, word senses, referential forms and syntactic structures; • patterns of topical or functional sub-structures that mean that frequencies and distributions of words, word senses, referential forms and syntactic structures will vary across a text; • relations between clauses or between referring expressions that can be signaled explicitly or implicitly, that reflect a text's coherence; • frequent appeal to reduced expressions that rely on context to • efficiently convey their message. Recognizing and deploying these properties promises to improve both fluency and accuracy in SMT -i.e., whether the sequence of sentences in the target text conveys the same information as those in its source, in as readable a manner. This presentation describes how researchers are attempting to do this, without bringing translation to a halt.
منابع مشابه
Towards a discourse relation-aware approach for Chinese-English machine translation
Translation of discourse relations is one of the recent efforts of incorporating discourse information to statistical machine translation (SMT). While existing works focus on disambiguation of ambiguous discourse connectives, or transformation of discourse trees, only explicit discourse relations are tackled. A greater challenge exists in machine translation of Chinese, since implicit discourse...
متن کاملDisambiguating Temporal–Contrastive Discourse Connectives for Machine Translation
Temporal–contrastive discourse connectives (although, while, since, etc.) signal various types of relations between clauses such as temporal, contrast, concession and cause. They are often ambiguous and therefore difficult to translate from one language to another. We discuss several new and translation-oriented experiments for the disambiguation of a specific subset of discourse connectives in...
متن کاملDiscourse-level features for statistical machine translation
The talk will show how the disambiguation of discourse connectives can improve their automatic translation. Connectives are a class of frequent functional lexical items that play an important role in text readability and coherence. Longer-range context is taken into account to learn the signaled rhetorical relations. The labels obtained from a discourse connective classifier are then integrated...
متن کاملMachine Translation , Ten Years On : Discourse has yet to make a breakthrough
As already mentioned, most machine translation systems perform translation sentence by sentence. But even in the case of paragraph translation, the discourse structure of the target text tends to be identical to that of the source text. However, the sublanguage discourse structures may differ across the different languages, and thus a translated text which assumes the same discourse structure a...
متن کاملAssessing the Discourse Factors that Influence the Quality of Machine Translation
We present a study of aspects of discourse structure — specifically discourse devices used to organize information in a sentence — that significantly impact the quality of machine translation. Our analysis is based on manual evaluations of translations of news from Chinese and Arabic to English. We find that there is a particularly strong mismatch in the notion of what constitutes a sentence in...
متن کاملImplicitation of Discourse Connectives in (Machine) Translation
Explicit discourse connectives in a source language text are not always translated to comparable words or phrases in the target language. The paper provides a corpus analysis and a method for semi-automatic detection of such cases. Results show that discourse connectives are not translated into comparable forms (or even any form at all), in up to 18% of human reference translations from English...
متن کامل